4¶

Question:

Please do classification survive or not survive from titanic dataset titanic (you can get it from the public portal dataset). Please do some tasks Exploratory Data Analysis (EDA), explain the hyperparameter used. Please explain bias-tradeoff and how you implement in this case

Answer

In [ ]:
import pandas as pd
pd.set_option('future.no_silent_downcasting',True)

dt = pd.read_csv('./data.csv')
In [ ]:
# Descripbe Data Shape
print("Data Shape")
print(dt.shape)
print("--------------")

# Describe overall data
print("Data Info")
print(dt.info(memory_usage=False))
print("--------------")

print("Data Description")
print(dt.describe())
print("--------------")
Data Shape
(891, 12)
--------------
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)None
--------------
Data Description
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
--------------
In [ ]:
dt.head()
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [ ]:
from ydata_profiling import ProfileReport

ProfileReport(dt, title="Profiling Report")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[ ]:

Data preprocessing¶

Used Column:

  • Survival : Y
  • pclass
  • sex
  • age
  • sibsp
  • parch
  • embarked
  • fare

Missing Value

  • Age => Fill with mean
  • Embarked => Fill with longest one
In [ ]:
dt
Out[ ]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

In [ ]:
df = dt.copy()

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Feature Selection
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare', 'Embarked']]
df['Sex'] = df['Sex'].replace(['male','female'],[1,0])
df['Embarked'] = df['Embarked'].replace(['S','C','Q'],[0,1,2])

# Normalisation using min_max scaller
mms = MinMaxScaler()
for i in df.drop(columns=['Survived']).columns.to_list():
    df[i] = mms.fit_transform(df[[i]])

# Add Null value
# Age => Fill with mean :
#
df = df.fillna({
    "Age": 29,
    "Embarked": 0,
})

# split to training and test
x = df.drop(columns=['Survived'])
y = df['Survived'].astype('int')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)

Data processing¶

In [ ]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score, precision_score

%matplotlib inline

# DEFINE Hyperparamater
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Deine Classifier
clf_t = DecisionTreeClassifier()

# Define cross validation strtegy
cv = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(clf_t, param_grid, cv=cv, scoring='accuracy')


grid_search.fit(x_train, y_train)

print("Best Hyperparamater", grid_search.best_params_)
print("Best Score", grid_search.best_score_)

# Using cross validation use hyperparamate
clf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=1,min_samples_split=2)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print(f"Accuracy  : {accuracy_score(y_test, y_pred)}")
print(f"Precision : {precision_score(y_test, y_pred)}")
print(f"Recall    : {recall_score(y_test, y_pred)}")
print(f"F1-Score  : {f1_score(y_test, y_pred)}")


plot_tree(clf)
Best Hyperparamater {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score 0.8230473751600511
[[91  9]
 [27 52]]
Accuracy  : 0.7988826815642458
Precision : 0.8524590163934426
Recall    : 0.6582278481012658
F1-Score  : 0.7428571428571429
Out[ ]:
[Text(0.5, 0.875, 'x[1] <= 0.5\ngini = 0.466\nsamples = 712\nvalue = [449, 263]'),
 Text(0.25, 0.625, 'x[0] <= 0.75\ngini = 0.397\nsamples = 253\nvalue = [69, 184]'),
 Text(0.125, 0.375, 'x[2] <= 0.026\ngini = 0.111\nsamples = 135\nvalue = [8, 127]'),
 Text(0.0625, 0.125, 'gini = 0.5\nsamples = 2\nvalue = [1, 1]'),
 Text(0.1875, 0.125, 'gini = 0.1\nsamples = 133\nvalue = [7, 126]'),
 Text(0.375, 0.375, 'x[5] <= 0.046\ngini = 0.499\nsamples = 118\nvalue = [61.0, 57.0]'),
 Text(0.3125, 0.125, 'gini = 0.488\nsamples = 95\nvalue = [40, 55]'),
 Text(0.4375, 0.125, 'gini = 0.159\nsamples = 23\nvalue = [21, 2]'),
 Text(0.75, 0.625, 'x[2] <= 0.076\ngini = 0.285\nsamples = 459\nvalue = [380, 79]'),
 Text(0.625, 0.375, 'x[3] <= 0.25\ngini = 0.415\nsamples = 17\nvalue = [5, 12]'),
 Text(0.5625, 0.125, 'gini = 0.0\nsamples = 11\nvalue = [0, 11]'),
 Text(0.6875, 0.125, 'gini = 0.278\nsamples = 6\nvalue = [5, 1]'),
 Text(0.875, 0.375, 'x[0] <= 0.25\ngini = 0.257\nsamples = 442\nvalue = [375, 67]'),
 Text(0.8125, 0.125, 'gini = 0.44\nsamples = 98\nvalue = [66.0, 32.0]'),
 Text(0.9375, 0.125, 'gini = 0.183\nsamples = 344\nvalue = [309, 35]')]
No description has been provided for this image